Week 10 - Day 3
Explanatory model - this is what we explored yesterday; checking for statistical significance, and adding to the model Continue adding variables until you’ve built the model. You can explain it and you know how your model is able to predict
Predictive model - black box models; strong predictive power, but low statistical evidence
Systematic approach to model building, but it is not an exact science; it is equally an art, and therefore domain knowledge is important.
library(tidyverse)
library(car)
library(modelr)
library(GGally)
Getting rid of the census variable
checking for missing values (which are then dropped up above)
summary(prestige_trim)
education income women prestige type
Min. : 6.380 Min. : 611 Min. : 0.000 Min. :14.80 bc :44
1st Qu.: 8.445 1st Qu.: 4106 1st Qu.: 3.592 1st Qu.:35.23 prof:31
Median :10.540 Median : 5930 Median :13.600 Median :43.60 wc :23
Mean :10.738 Mean : 6798 Mean :28.979 Mean :46.83 NA's: 4
3rd Qu.:12.648 3rd Qu.: 8187 3rd Qu.:52.203 3rd Qu.:59.27
Max. :15.970 Max. :25879 Max. :97.510 Max. :87.20
Choosing first predictor
# prestige will be first predictor
prestige_trim %>%
ggpairs(aes(color = type, alpha = 0.5))
# from these plots, it looks like education, income, and type
Looking at education first
mod1a <- lm(prestige ~ education, data = prestige_trim)
mod1a
Call:
lm(formula = prestige ~ education, data = prestige_trim)
Coefficients:
(Intercept) education
-10.841 5.388
Interpretation of this:
Prestige = -10.841 + 5.388 * education
Which means… -10.841 is where the line intercepets the y-axis. And then it increases every…..
summary(mod1a)
Call:
lm(formula = prestige ~ education, data = prestige_trim)
Residuals:
Min 1Q Median 3Q Max
-21.605 -6.151 0.366 6.565 17.540
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -10.8409 3.5285 -3.072 0.00276 **
education 5.3884 0.3168 17.006 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.578 on 96 degrees of freedom
Multiple R-squared: 0.7508, Adjusted R-squared: 0.7482
F-statistic: 289.2 on 1 and 96 DF, p-value: < 2.2e-16
Diagnostic plot
par(mfrow = c(2, 2))
plot(mod1a)
Task
Adding a second predictor into the plot (using the residuals)
prestige_remain_resid %>%
ggpairs(aes(color = type, alpha = 0.5))
plot: [1,1] [======>----------------------------------------------------------------------------------------------------] 6% est: 0s
plot: [1,2] [============>----------------------------------------------------------------------------------------------] 12% est: 0s
plot: [1,3] [===================>---------------------------------------------------------------------------------------] 19% est: 0s
plot: [1,4] [==========================>--------------------------------------------------------------------------------] 25% est: 0s
plot: [2,1] [================================>--------------------------------------------------------------------------] 31% est: 0s
plot: [2,2] [=======================================>-------------------------------------------------------------------] 38% est: 0s
plot: [2,3] [==============================================>------------------------------------------------------------] 44% est: 0s
plot: [2,4] [=====================================================>-----------------------------------------------------] 50% est: 0s
plot: [3,1] [===========================================================>-----------------------------------------------] 56% est: 0s `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot: [3,2] [==================================================================>----------------------------------------] 62% est: 0s `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot: [3,3] [=========================================================================>---------------------------------] 69% est: 0s
plot: [3,4] [===============================================================================>---------------------------] 75% est: 0s
plot: [4,1] [======================================================================================>--------------------] 81% est: 0s
plot: [4,2] [=============================================================================================>-------------] 88% est: 0s
plot: [4,3] [===================================================================================================>-------] 94% est: 0s `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot: [4,4] [===========================================================================================================]100% est: 0s
Second model development; adding in residual or income
summary(mod2a)
Call:
lm(formula = prestige ~ education + income, data = prestige_trim)
Residuals:
Min 1Q Median 3Q Max
-16.9367 -4.8881 0.0116 4.9690 15.9280
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.6210352 3.1162309 -2.446 0.0163 *
education 4.2921076 0.3360645 12.772 < 2e-16 ***
income 0.0012415 0.0002185 5.682 1.45e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.45 on 95 degrees of freedom
Multiple R-squared: 0.814, Adjusted R-squared: 0.8101
F-statistic: 207.9 on 2 and 95 DF, p-value: < 2.2e-16
mod2b <- lm(prestige ~ education + type, data = prestige_trim)
summary(mod2b)
Call:
lm(formula = prestige ~ education + type, data = prestige_trim)
Residuals:
Min 1Q Median 3Q Max
-19.410 -5.508 1.360 5.694 17.171
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.6982 5.7361 -0.470 0.6392
education 4.5728 0.6716 6.809 9.16e-10 ***
typeprof 6.1424 4.2590 1.442 0.1526
typewc -5.4585 2.6907 -2.029 0.0453 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.814 on 94 degrees of freedom
Multiple R-squared: 0.7975, Adjusted R-squared: 0.791
F-statistic: 123.4 on 3 and 94 DF, p-value: < 2.2e-16
The p-value for type is high; for one dummy variable it is abov 0.05, and for one, below. To check whether it is as a whole above or below, we can use the function anova.
anova(mod1a, mod2b)
Analysis of Variance Table
Model 1: prestige ~ education
Model 2: prestige ~ education + type
Res.Df RSS Df Sum of Sq F Pr(>F)
1 96 7064.4
2 94 5740.0 2 1324.4 10.844 5.787e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This shows that it IS statistically significant, so it could be used. However, we noted that income was a better predictor, so we will proceed with this path.
prestige_remain_resid <- prestige_trim %>%
add_residuals(mod2a) %>%
select(-c("prestige", "education", "income"))
prestige_remain_resid %>%
ggpairs(aes(colour = type, alpha = 0.5))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There is still some variation in type, so we wil use this for our third predictor.
mod3a <- lm(prestige ~ education + income + type, data = prestige_trim)
summary(mod3a)
Call:
lm(formula = prestige ~ education + income + type, data = prestige_trim)
Residuals:
Min 1Q Median 3Q Max
-14.9529 -4.4486 0.1678 5.0566 18.6320
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.6229292 5.2275255 -0.119 0.905
education 3.6731661 0.6405016 5.735 1.21e-07 ***
income 0.0010132 0.0002209 4.586 1.40e-05 ***
typeprof 6.0389707 3.8668551 1.562 0.122
typewc -2.7372307 2.5139324 -1.089 0.279
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.095 on 93 degrees of freedom
Multiple R-squared: 0.8349, Adjusted R-squared: 0.8278
F-statistic: 117.5 on 4 and 93 DF, p-value: < 2.2e-16
While the two type variables are not significant, there is one which is hidden, so it is important to run the anova anways
anova(mod2a, mod3a)
Analysis of Variance Table
Model 1: prestige ~ education + income
Model 2: prestige ~ education + income + type
Res.Df RSS Df Sum of Sq F Pr(>F)
1 95 5272.4
2 93 4681.3 2 591.16 5.8721 0.003966 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
** BREAK **
Interactions
You can only look at interactions if the variable is in the model.
Examining interactions
Education, and income - both are continuous
Adding some regression lines
This shows how the trend changes; as education increases, so does income (and the correlation begins to go downwards)
Looking at education and type
prestige_resid %>%
ggplot() +
aes(x = education, y = resid, color = type) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
`geom_smooth()` using formula 'y ~ x'
Income and type
prestige_resid %>%
ggplot() +
aes(x = income, y = resid, color = type) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
`geom_smooth()` using formula 'y ~ x'
Task Test all 3 interactions in your model seperately, and choose the best.
summary(mod4c)
Call:
lm(formula = prestige ~ education + income + type + income:type,
data = prestige_trim)
Residuals:
Min 1Q Median 3Q Max
-12.8720 -4.8321 0.8534 4.1425 19.6710
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.7272633 4.9515480 -1.359 0.1776
education 3.0396961 0.6003699 5.063 2.14e-06 ***
income 0.0031344 0.0005215 6.010 3.79e-08 ***
typeprof 25.1723873 5.4669586 4.604 1.34e-05 ***
typewc 7.1375093 5.2898177 1.349 0.1806
income:typeprof -0.0025102 0.0005530 -4.539 1.72e-05 ***
income:typewc -0.0014856 0.0008720 -1.704 0.0919 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.455 on 91 degrees of freedom
Multiple R-squared: 0.8663, Adjusted R-squared: 0.8574
F-statistic: 98.23 on 6 and 91 DF, p-value: < 2.2e-16
Imagine a question is set by manager, who asks what has the most impact / where should we focus our efforts? This is how to determine this. Relative importance analysis
lmg - lindemann, merenda and gold method
calc.relimp(mod4c, type = "lmg", rela = T)
Response variable: prestige
Total response variance: 292.2358
Analysis based on 98 observations
6 Regressors:
Some regressors combined in groups:
Group type : typeprof typewc
Group income:type : income:typeprof income:typewc
Relative importance of 4 (groups of) regressors assessed:
type income:type education income
Proportion of variance explained by model: 86.63%
Metrics are normalized to sum to 100% (rela=TRUE).
Relative importance metrics:
lmg
type 0.39609310
income:type 0.04222622
education 0.30489921
income 0.25678147
Average coefficients for different model sizes:
1group 2groups 3groups 4groups
education 5.388407674 4.432450330 3.673166052 3.039696088
income 0.002843574 0.001321366 0.002518339 0.003134410
typeprof 32.321114370 15.598958947 25.528996354 25.172387319
typewc 6.716205534 0.854330301 8.121753937 7.137509272
income:typeprof NaN NaN -0.003178285 -0.002510174
income:typewc NaN NaN -0.002171217 -0.001485560
The variable with the highest metric is type (even though we only added it / found out about it 3rd) - it has 40%. So this would provide a good argument for a business to give good weight to this metric
library(lm.beta)
mod4c_beta <- lm.beta(mod4c)
summary(mod4c_beta)
Call:
lm(formula = prestige ~ education + income + type + income:type,
data = prestige_trim)
Residuals:
Min 1Q Median 3Q Max
-12.8720 -4.8321 0.8534 4.1425 19.6710
Coefficients:
Estimate Standardized Std. Error t value Pr(>|t|)
(Intercept) -6.7272633 0.0000000 4.9515480 -1.359 0.1776
education 3.0396961 0.4887966 0.6003699 5.063 2.14e-06 ***
income 0.0031344 0.7752428 0.0005215 6.010 3.79e-08 ***
typeprof 25.1723873 0.6882988 5.4669586 4.604 1.34e-05 ***
typewc 7.1375093 0.1778589 5.2898177 1.349 0.1806
income:typeprof -0.0025102 -0.8493433 0.0005530 -4.539 1.72e-05 ***
income:typewc -0.0014856 -0.2036043 0.0008720 -1.704 0.0919 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.455 on 91 degrees of freedom
Multiple R-squared: 0.8663, Adjusted R-squared: 0.8574
F-statistic: 98.23 on 6 and 91 DF, p-value: < 2.2e-16
** LUNCH BREAK **
The curse of dimensionality
the more variables you add, the more difficult you make your model / challenges your model faces
Ways to tackle this problem
Variable reduction - we’ve already done this, by selecting the relevant / best variables
Dimensionality reduction - reducing variables into components, e.g. 40% of one, 40% of another, and 20% of a 3rd variable
variable reduction - there are 3 ways to do this: - filter. assessing the variable by some determinate, and filtering out if they don’t meet it - wrapper. checking for correlation, checking residual, and then deciding. Forward selection is the method used in the first lesson (starting with nothing, and then adding in) - embedded.
Dimensionality Reduction
Principal Component Analysis
Because it uses a matrix, you have to do variable engineering first (you can’t use categorical variables with this)
Principal Compenent Analysis Task
cleaning dataset
cars_data <- cars_data %>%
select(-c("vs", "am"))
Error in select(., -c("vs", "am")) : unused argument (-c("vs", "am"))
creating the PCA
summary(cars_pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
Standard deviation 2.3782 1.4429 0.71008 0.51481 0.42797 0.35184 0.32413 0.2419 0.14896
Proportion of Variance 0.6284 0.2313 0.05602 0.02945 0.02035 0.01375 0.01167 0.0065 0.00247
Cumulative Proportion 0.6284 0.8598 0.91581 0.94525 0.96560 0.97936 0.99103 0.9975 1.00000
fviz_pca_ind(cars_pca,
repel = TRUE)
Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider increasing max.overlaps
Creating a plot which is a mix of the two previous plots
fviz_pca_biplot(cars_pca,
repel = TRUE,
col.var = "#00008b",
col.ind = "#d3d3d3")
Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider increasing max.overlaps
summary(unscaled_pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
Standard deviation 136.532 38.14735 3.06642 1.27492 0.90474 0.64734 0.3054 0.2859 0.2159
Proportion of Variance 0.927 0.07237 0.00047 0.00008 0.00004 0.00002 0.0000 0.0000 0.0000
Cumulative Proportion 0.927 0.99938 0.99985 0.99993 0.99997 0.99999 1.0000 1.0000 1.0000